Assignment 03

Author
Affiliation

Emily Sundberg

Boston University

Published

September 24, 2025

Modified

September 26, 2025

1 Load the Data Set

import pandas as pd
import plotly.express as px
import plotly.io as pio
from pyspark.sql import SparkSession
import re
import numpy as np
import plotly.graph_objects as go
from pyspark.sql.functions import col, split, explode, regexp_replace, transform, when
from pyspark.sql import functions as F
from pyspark.sql.functions import col, monotonically_increasing_id

np.random.seed(42)

pio.renderers.default = "notebook"

spark = SparkSession.builder.appName("LightcastData").getOrCreate()

df = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("/home/ubuntu/assignment-03-emily-sundberg-1/data/lightcast_job_postings.csv")
df.createOrReplaceTempView("job_postings")

#print("---This is Diagnostic check, No need to print it in the final doc---")

#df.printSchema() # comment this line when rendering the submission
df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/26 23:41:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 1:>                                                          (0 + 1) / 1]                                                                                25/09/26 23:42:10 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 2:>                                                          (0 + 1) / 1]                                                                                
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|                  ID|LAST_UPDATED_DATE|LAST_UPDATED_TIMESTAMP|DUPLICATES|  POSTED|  EXPIRED|DURATION|        SOURCE_TYPES|             SOURCES|                 URL|ACTIVE_URLS|ACTIVE_SOURCES_INFO|           TITLE_RAW|                BODY|MODELED_EXPIRED|MODELED_DURATION| COMPANY|        COMPANY_NAME|COMPANY_RAW|COMPANY_IS_STAFFING|EDUCATION_LEVELS|EDUCATION_LEVELS_NAME|MIN_EDULEVELS| MIN_EDULEVELS_NAME|MAX_EDULEVELS|MAX_EDULEVELS_NAME|EMPLOYMENT_TYPE|EMPLOYMENT_TYPE_NAME|MIN_YEARS_EXPERIENCE|MAX_YEARS_EXPERIENCE|IS_INTERNSHIP|SALARY|REMOTE_TYPE|REMOTE_TYPE_NAME|ORIGINAL_PAY_PERIOD|SALARY_TO|SALARY_FROM|            LOCATION|                CITY|    CITY_NAME|COUNTY|   COUNTY_NAME|  MSA|            MSA_NAME|STATE|STATE_NAME|COUNTY_OUTGOING|COUNTY_NAME_OUTGOING|COUNTY_INCOMING|COUNTY_NAME_INCOMING|MSA_OUTGOING|   MSA_NAME_OUTGOING|MSA_INCOMING|   MSA_NAME_INCOMING|NAICS2|         NAICS2_NAME|NAICS3|         NAICS3_NAME|NAICS4|         NAICS4_NAME|NAICS5|         NAICS5_NAME|NAICS6|         NAICS6_NAME|             TITLE|         TITLE_NAME|         TITLE_CLEAN|              SKILLS|         SKILLS_NAME|  SPECIALIZED_SKILLS|SPECIALIZED_SKILLS_NAME|      CERTIFICATIONS| CERTIFICATIONS_NAME|       COMMON_SKILLS|  COMMON_SKILLS_NAME|     SOFTWARE_SKILLS|SOFTWARE_SKILLS_NAME|      ONET|           ONET_NAME| ONET_2019|      ONET_2019_NAME|                CIP6|           CIP6_NAME|                CIP4|           CIP4_NAME|                CIP2|           CIP2_NAME|SOC_2021_2|     SOC_2021_2_NAME|SOC_2021_3|     SOC_2021_3_NAME|SOC_2021_4|SOC_2021_4_NAME|SOC_2021_5|SOC_2021_5_NAME|LOT_CAREER_AREA|LOT_CAREER_AREA_NAME|LOT_OCCUPATION| LOT_OCCUPATION_NAME|LOT_SPECIALIZED_OCCUPATION|LOT_SPECIALIZED_OCCUPATION_NAME|LOT_OCCUPATION_GROUP|LOT_OCCUPATION_GROUP_NAME|LOT_V6_SPECIALIZED_OCCUPATION|LOT_V6_SPECIALIZED_OCCUPATION_NAME|LOT_V6_OCCUPATION|LOT_V6_OCCUPATION_NAME|LOT_V6_OCCUPATION_GROUP|LOT_V6_OCCUPATION_GROUP_NAME|LOT_V6_CAREER_AREA|LOT_V6_CAREER_AREA_NAME|  SOC_2|          SOC_2_NAME|  SOC_3|          SOC_3_NAME|  SOC_4|     SOC_4_NAME|  SOC_5|     SOC_5_NAME|LIGHTCAST_SECTORS|LIGHTCAST_SECTORS_NAME|NAICS_2022_2|   NAICS_2022_2_NAME|NAICS_2022_3|   NAICS_2022_3_NAME|NAICS_2022_4|   NAICS_2022_4_NAME|NAICS_2022_5|   NAICS_2022_5_NAME|NAICS_2022_6|   NAICS_2022_6_NAME|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|1f57d95acf4dc67ed...|         9/6/2024|  2024-09-06 20:32:...|         0|6/2/2024| 6/8/2024|       6|   [\n  "Company"\n]|[\n  "brassring.c...|[\n  "https://sjo...|         []|               NULL|Enterprise Analys...|31-May-2024\n\nEn...|       6/8/2024|               6|  894731|          Murphy USA| Murphy USA|              false|       [\n  2\n]| [\n  "Bachelor's ...|            2|  Bachelor's degree|         NULL|              NULL|              1|Full-time (> 32 h...|                   2|                   2|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 33.20...|RWwgRG9yYWRvLCBBUg==|El Dorado, AR|  5139|     Union, AR|20980|       El Dorado, AR|    5|  Arkansas|           5139|           Union, AR|           5139|           Union, AR|       20980|       El Dorado, AR|       20980|       El Dorado, AR|    44|        Retail Trade|   441|Motor Vehicle and...|  4413|Automotive Parts,...| 44133|Automotive Parts ...|441330|Automotive Parts ...|ET29C073C03D1F86B4|Enterprise Analysts|enterprise analys...|[\n  "KS126DB6T06...|[\n  "Merchandisi...|[\n  "KS126DB6T06...|   [\n  "Merchandisi...|                  []|                  []|[\n  "KS126706DPF...|[\n  "Mathematics...|[\n  "KS440W865GC...|[\n  "SQL (Progra...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|[\n  "45.0601",\n...|[\n  "Economics, ...|[\n  "45.06",\n  ...|[\n  "Economics",...|[\n  "45",\n  "27...|[\n  "Social Scie...|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101011|           General ERP Analy...|                2310|     Business Intellig...|                     23101011|              General ERP Analy...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|        [\n  7\n]|  [\n  "Artificial ...|          44|        Retail Trade|         441|Motor Vehicle and...|        4413|Automotive Parts,...|       44133|Automotive Parts ...|      441330|Automotive Parts ...|
|0cb072af26757b6c4...|         8/2/2024|  2024-08-02 17:08:...|         0|6/2/2024| 8/1/2024|    NULL| [\n  "Job Board"\n]| [\n  "maine.gov"\n]|[\n  "https://job...|         []|               NULL|Oracle Consultant...|Oracle Consultant...|       8/1/2024|            NULL|  133098|Smx Corporation L...|        SMX|               true|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              1|Full-time (> 32 h...|                   3|                   3|        false|  NULL|          1|          Remote|               NULL|     NULL|       NULL|{\n  "lat": 44.31...|    QXVndXN0YSwgTUU=|  Augusta, ME| 23011|  Kennebec, ME|12300|Augusta-Watervill...|   23|     Maine|          23011|        Kennebec, ME|          23011|        Kennebec, ME|       12300|Augusta-Watervill...|       12300|Augusta-Watervill...|    56|Administrative an...|   561|Administrative an...|  5613| Employment Services| 56132|Temporary Help Se...|561320|Temporary Help Se...|ET21DDA63780A7DC09| Oracle Consultants|oracle consultant...|[\n  "KS122626T55...|[\n  "Procurement...|[\n  "KS122626T55...|   [\n  "Procurement...|                  []|                  []|                  []|                  []|[\n  "BGSBF3F508F...|[\n  "Oracle Busi...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101012|           Oracle Consultant...|                2310|     Business Intellig...|                     23101012|              Oracle Consultant...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          56|Administrative an...|         561|Administrative an...|        5613| Employment Services|       56132|Temporary Help Se...|      561320|Temporary Help Se...|
|85318b12b3331fa49...|         9/6/2024|  2024-09-06 20:32:...|         1|6/2/2024| 7/7/2024|      35| [\n  "Job Board"\n]|[\n  "dejobs.org"\n]|[\n  "https://dej...|         []|               NULL|        Data Analyst|Taking care of pe...|      6/10/2024|               8|39063746|            Sedgwick|   Sedgwick|              false|       [\n  2\n]| [\n  "Bachelor's ...|            2|  Bachelor's degree|         NULL|              NULL|              1|Full-time (> 32 h...|                   5|                NULL|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 32.77...|    RGFsbGFzLCBUWA==|   Dallas, TX| 48113|    Dallas, TX|19100|Dallas-Fort Worth...|   48|     Texas|          48113|          Dallas, TX|          48113|          Dallas, TX|       19100|Dallas-Fort Worth...|       19100|Dallas-Fort Worth...|    52|Finance and Insur...|   524|Insurance Carrier...|  5242|Agencies, Brokera...| 52429|Other Insurance R...|524291|    Claims Adjusting|ET3037E0C947A02404|      Data Analysts|        data analyst|[\n  "KS1218W78FG...|[\n  "Management"...|[\n  "ESF3939CE1F...|   [\n  "Exception R...|[\n  "KS683TN76T7...|[\n  "Security Cl...|[\n  "KS1218W78FG...|[\n  "Management"...|[\n  "KS126HY6YLT...|[\n  "Microsoft O...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231113|Data / Data Minin...|                  23111310|                   Data Analyst|                2311|     Data Analysis and...|                     23111310|                      Data Analyst|           231113|  Data / Data Minin...|                   2311|        Data Analysis and...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          52|Finance and Insur...|         524|Insurance Carrier...|        5242|Agencies, Brokera...|       52429|Other Insurance R...|      524291|    Claims Adjusting|
|1b5c3941e54a1889e...|         9/6/2024|  2024-09-06 20:32:...|         1|6/2/2024|7/20/2024|      48| [\n  "Job Board"\n]|[\n  "disabledper...|[\n  "https://www...|         []|               NULL|Sr. Lead Data Mgm...|About this role:\...|      6/12/2024|              10|37615159|         Wells Fargo|Wells Fargo|              false|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              1|Full-time (> 32 h...|                   3|                NULL|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 33.44...|    UGhvZW5peCwgQVo=|  Phoenix, AZ|  4013|  Maricopa, AZ|38060|Phoenix-Mesa-Chan...|    4|   Arizona|           4013|        Maricopa, AZ|           4013|        Maricopa, AZ|       38060|Phoenix-Mesa-Chan...|       38060|Phoenix-Mesa-Chan...|    52|Finance and Insur...|   522|Credit Intermedia...|  5221|Depository Credit...| 52211|  Commercial Banking|522110|  Commercial Banking|ET2114E0404BA30075|Management Analysts|sr lead data mgmt...|[\n  "KS123QX62QY...|[\n  "Exit Strate...|[\n  "KS123QX62QY...|   [\n  "Exit Strate...|                  []|                  []|[\n  "KS7G6NP6R6L...|[\n  "Reliability...|[\n  "KS4409D76NW...|[\n  "SAS (Softwa...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231113|Data / Data Minin...|                  23111310|                   Data Analyst|                2311|     Data Analysis and...|                     23111310|                      Data Analyst|           231113|  Data / Data Minin...|                   2311|        Data Analysis and...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|        [\n  6\n]|  [\n  "Data Privac...|          52|Finance and Insur...|         522|Credit Intermedia...|        5221|Depository Credit...|       52211|  Commercial Banking|      522110|  Commercial Banking|
|cb5ca25f02bdf25c1...|        6/19/2024|   2024-06-19 07:00:00|         0|6/2/2024|6/17/2024|      15|[\n  "FreeJobBoar...|[\n  "craigslist....|[\n  "https://mod...|         []|               NULL|Comisiones de $10...|Comisiones de $10...|      6/17/2024|              15|       0|        Unclassified|      LH/GM|              false|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              3|Part-time / full-...|                NULL|                NULL|        false| 92500|          0|          [None]|               year|   150000|      35000|{\n  "lat": 37.63...|    TW9kZXN0bywgQ0E=|  Modesto, CA|  6099|Stanislaus, CA|33700|         Modesto, CA|    6|California|           6099|      Stanislaus, CA|           6099|      Stanislaus, CA|       33700|         Modesto, CA|       33700|         Modesto, CA|    99|Unclassified Indu...|   999|Unclassified Indu...|  9999|Unclassified Indu...| 99999|Unclassified Indu...|999999|Unclassified Indu...|ET0000000000000000|       Unclassified|comisiones de por...|                  []|                  []|                  []|                     []|                  []|                  []|                  []|                  []|                  []|                  []|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101012|           Oracle Consultant...|                2310|     Business Intellig...|                     23101012|              Oracle Consultant...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          99|Unclassified Indu...|         999|Unclassified Indu...|        9999|Unclassified Indu...|       99999|Unclassified Indu...|      999999|Unclassified Indu...|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
only showing top 5 rows

2 Data Cleaning

2.1 Casting salary and experience columns

df = df.withColumn("SALARY_FROM", col ("SALARY_FROM").cast("float"))\
  .withColumn("SALARY_TO", col("SALARY_TO").cast("float")) \
  .withColumn("MAX_YEARS_EXPERIENCE", col("MAX_YEARS_EXPERIENCE").cast("float"))\
  .withColumn("MIN_YEARS_EXPERIENCE", col("MIN_YEARS_EXPERIENCE").cast("float"))\
  .withColumn("SALARY", col("SALARY").cast("float"))

2.2 Computing medians for salary columns

def compute_median(sdf, col_name):
  q = sdf.approxQuantile(col_name, [0.5], 0.01)
  return q[0] if q else None


median_from = compute_median(df, "SALARY_FROM")
median_to = compute_median(df, "SALARY_TO")
median_salary = compute_median(df,"SALARY")

print("Medians:", median_from, median_to, median_salary)
[Stage 3:>                                                          (0 + 1) / 1]                                                                                [Stage 4:>                                                          (0 + 1) / 1]                                                                                [Stage 5:>                                                          (0 + 1) / 1]
Medians: 87295.0 130042.0 115024.0
                                                                                

2.3 Impute missing salaries

df = df.fillna({
  "SALARY_FROM": median_from,
  "SALARY_TO": median_to,
  "SALARY": median_salary
})

2.4 Cleaning Education Column

df = df.withColumn("EDUCATION_LEVELS_NAME", regexp_replace(col("EDUCATION_LEVELS_NAME"), "[\n\r]", ""))
ed = df.select("EDUCATION_LEVELS_NAME")
ed.show(15)
+---------------------+
|EDUCATION_LEVELS_NAME|
+---------------------+
| [  "Bachelor's de...|
| [  "No Education ...|
| [  "Bachelor's de...|
| [  "No Education ...|
| [  "No Education ...|
| [  "Bachelor's de...|
| [  "Bachelor's de...|
| [  "Bachelor's de...|
| [  "No Education ...|
| [  "Bachelor's de...|
| [  "High school o...|
| [  "No Education ...|
| [  "Bachelor's de...|
| [  "Bachelor's de...|
| [  "No Education ...|
+---------------------+
only showing top 15 rows

2.5 Compute Average Salary

df = df.withColumn("AVG_SALARY", (col("SALARY_FROM")+col("SALARY_TO"))/2)

2.6 Exporting Cleaned Data

df_selected = df.select(
  "EDUCATION_LEVELS_NAME",
  "REMOTE_TYPE_NAME",
  "MAX_YEARS_EXPERIENCE",
  "AVG_SALARY",
  "LOT_V6_SPECIALIZED_OCCUPATION_NAME",
  "NAICS2_NAME")

2.7 Save to CSV

pdf = df_selected.toPandas()

pdf.to_csv("./data/lightcast_cleaned.csv", index=False)

print("Data Cleaning Complete. Rows retained:", len(pdf))
[Stage 7:>                                                          (0 + 1) / 1]                                                                                
Data Cleaning Complete. Rows retained: 72498

3 Salary Distribution by Industry and Employment Type

3.1 Remove records where salary is missing or zero

pdf = df.filter(df["SALARY"] >0).select("EMPLOYMENT_TYPE_NAME", "SALARY","NAICS2_NAME").toPandas()
pdf["EMPLOYMENT_TYPE_NAME"] = pdf["EMPLOYMENT_TYPE_NAME"].fillna("Unknown")
pdf["EMPLOYMENT_TYPE_NAME"] = pdf["EMPLOYMENT_TYPE_NAME"].apply(lambda x: re.sub(r"[^\x00-\x7F]+","",x))

pdf.head()
[Stage 8:>                                                          (0 + 1) / 1]                                                                                
EMPLOYMENT_TYPE_NAME SALARY NAICS2_NAME
0 Full-time (> 32 hours) 115024.0 Retail Trade
1 Full-time (> 32 hours) 115024.0 Administrative and Support and Waste Managemen...
2 Full-time (> 32 hours) 115024.0 Finance and Insurance
3 Full-time (> 32 hours) 115024.0 Finance and Insurance
4 Part-time / full-time 92500.0 Unclassified Industry

3.2 Aggregate Data

median_salaries = pdf.groupby("EMPLOYMENT_TYPE_NAME")["SALARY"].median()

sorted_employment_types = median_salaries.sort_values(ascending = False).index

pdf["EMPLOYMENT_TYPE_NAME"] = pd.Categorical(
  pdf["EMPLOYMENT_TYPE_NAME"],
  categories=sorted_employment_types,
  ordered=True
)

3.3 Visualize Results

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

fig = px.box(
  pdf,
  x = "NAICS2_NAME",
  y = "SALARY",
  title = "Salary Distribution by NAICS2 Name",
  color_discrete_sequence = ["peru"],
  boxmode = "group",
  points = "outliers" 
)
fig.update_layout(
  margin=dict(t=150),
  width=2000, 
  height=800,
  title=dict(
    text = "Salary Distribution By NAICS2 Name",
    font=dict(size=30, family="Montserrat", color = "saddlebrown", weight="bold")
  ),

  xaxis=dict(
    title=dict(text="NAICS2 Name", font=dict(size=14, family="Montserrat", color="saddlebrown", weight="bold")),
    tickangle=50,
    showline=True,
    linewidth=2,
    linecolor="saddlebrown",
    mirror=True,
    showgrid=False,
    categoryorder="array",
    categoryarray=sorted_employment_types.tolist()
  ),

  yaxis=dict(
    title=dict(text="Salary (K $)", font=dict(size=14, family="Montserrat", color="saddlebrown", weight="bold")),
    tickvals=[0,50000,100000,150000,200000,250000,300000,350000,400000,450000,500000],
    ticktext=["0","50K","100K","150K","200K","250K","300K","350K","400K","450K","500K"],
    tickfont=dict(size=12, family="Montserrat", color="saddlebrown",weight="bold"),
    showline=True,
    linewidth=2,
    linecolor="saddlebrown",
    mirror=True,
    showgrid=True,
    gridcolor="tan",
    gridwidth=0.5
  )
)

fig.show()

The plot above shows us that the medians of these industries average around 100K but there are several career fields with the opportunity to make significantly more. Both admin support and information have outliers in the $500K range.

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

fig = px.box(
  pdf,
  x = "EMPLOYMENT_TYPE_NAME",
  y = "SALARY",
  title = "Salary Distribution by Employment Type",
  color_discrete_sequence = ["peru"],
  boxmode = "group",
  points = "outliers" 
)
fig.update_layout(
  margin=dict(t=150),
  width=2000, 
  height=800,
  title=dict(
    text = "Salary Distribution By Employment Type",
    font=dict(size=30, family="Montserrat", color = "saddlebrown", weight="bold")
  ),

  xaxis=dict(
    title=dict(text="Employment Type", font=dict(size=14, family="Montserrat", color="saddlebrown", weight="bold")),
    tickangle=50,
    showline=True,
    linewidth=2,
    linecolor="saddlebrown",
    mirror=True,
    showgrid=False,
    categoryorder="array",
    categoryarray=sorted_employment_types.tolist()
  ),

  yaxis=dict(
    title=dict(text="Salary (K $)", font=dict(size=14, family="Montserrat", color="saddlebrown", weight="bold")),
    tickvals=[0,50000,100000,150000,200000,250000,300000,350000,400000,450000,500000],
    ticktext=["0","50K","100K","150K","200K","250K","300K","350K","400K","450K","500K"],
    tickfont=dict(size=12, family="Montserrat", color="saddlebrown",weight="bold"),
    showline=True,
    linewidth=2,
    linecolor="saddlebrown",
    mirror=True,
    showgrid=True,
    gridcolor="tan",
    gridwidth=0.5
  )
)

fig.show()

4 Salary Analysis by ONET Occupation Type

4.1 Aggregate Data

salary_analysis = spark.sql("""
  SELECT
    LOT_OCCUPATION_NAME AS Occupation_Name,
    PERCENTILE(SALARY, 0.5) AS Median_Salary,
    Count(*) AS Job_Postings
  FROM job_postings
  GROUP BY LOT_OCCUPATION_NAME
  ORDER BY Job_Postings DESC
  LIMIT 10
""")

salary_pd = salary_analysis.toPandas()
salary_pd.head()
[Stage 9:>                                                          (0 + 1) / 1]                                                                                
Occupation_Name Median_Salary Job_Postings
0 Data / Data Mining Analyst 95250.0 30057
1 Business Intelligence Analyst 125900.0 29445
2 Computer Systems Engineer / Architect 157600.0 8212
3 Business / Management Analyst 93650.0 4326
4 Clinical Analyst / Clinical Documentation and ... 89440.0 261

4.2 Bubble Plot

fig = px.scatter(
  salary_pd,
  x="Occupation_Name",
  y="Median_Salary",
  size="Job_Postings",
  title = "Salary Analysis by Occupation (Bubble Chart)",
  labels={
    "Occupation_Name": "Occupation",
    "Median_Salary" : "Median Salary",
    "Job_Postings" : "Number of Job Postings"
  },
  hover_name = "Occupation_Name",
  size_max=60,
  width=1000,
  height=1000,
  color="Job_Postings",
  color_continuous_scale="Sunsetdark"
)
fig.update_layout(
  margin=dict(t=150),
  font_family="Montserrat",
  font_size = 14,
  title_font_size = 25,
  xaxis_title = "Occupation",
  yaxis_title = "Median Salary",
  plot_bgcolor = "white",
  xaxis=dict(
    tickangle=-45,
    showline=True,
    linecolor="black"
  ),
  yaxis=dict(
    showline=True,
    linecolor="black"
  )
)

fig.show()

This figure shows us that, although data analysts and business analysts have roughly the same median salary, data analysts have far more job postings than their business analyst counterparts.

5 Salary by Education Level

5.1 Create Education Groups

lower_degree = ["Bachelor's", "Associate","GED", "No Education Listed", "High School"]

higher_degree = ["Master's Degree", "PhD or professional degree"]

df = df.withColumn(
  "EDUCATION_GROUP",
  when(col("EDUCATION_LEVELS_NAME").rlike("|".join([f"(?i){deg}"for deg in lower_degree])), "Bachelor's or lower")
  .when(col("EDUCATION_LEVELS_NAME").rlike("|".join([f"(?i){deg}"for deg in higher_degree])), "Master's or PhD")
  .otherwise("Other")
)

df = df.filter(
  col("MAX_YEARS_EXPERIENCE").isNotNull() &
  col("AVG_SALARY").isNotNull() &
  (col("MAX_YEARS_EXPERIENCE")>0) &
  (col("AVG_SALARY") > 0)
)

df_filtered = df.filter(col("EDUCATION_GROUP").isin("Bachelor's or lower","Master's or PhD"))

df_pd = df_filtered.toPandas()
[Stage 12:>                                                         (0 + 1) / 1]                                                                                

5.2 Scatter Plot

fig = px.scatter(
  df_pd,
  x = "MAX_YEARS_EXPERIENCE",
  y = "AVG_SALARY",
  color = "EDUCATION_GROUP",
  hover_data=["LOT_V6_SPECIALIZED_OCCUPATION_NAME"],
  title = "Experience vs Salary by Education Level",
  opacity=0.7,
  color_discrete_sequence= ["palevioletred", "mediumseagreen"]
)

fig.update_traces(marker=dict(size=7, line=dict(width=1, color="black")))

fig.update_layout(
  plot_bgcolor = "papayawhip",
  font=dict(family = "Montserrat", size = 14),
  title_font = dict(size = 22, weight = "bold"),
  xaxis_title = "Years of Experience",
  yaxis_title = "Average Salary (USD)",
  legend_title = "Education Group",
  hoverlabel = dict(bgcolor = "white", font_size = 13, font_family = "Montserrat"),
  margin=dict(t=70, l=60, r=60),
  xaxis=dict(
    gridcolor="peru",
    tickmode = 'linear',
    dtick=1
  ),
  yaxis=dict(gridcolor = "peru")
)

fig.show()

This graph is showing that as years of experience increases, the floor for average salary also increases. Additionally, it shows that Bachelor’s or lower appears to have significantly more outliers while Master’s or PhD is more consistent.

6 Salary By Remote Work Type

6.1 Split Work Type Groups

from pyspark.sql.functions import when, col, trim

df = df.withColumn("REMOTE_GROUP",
  when(trim(col("REMOTE_TYPE_NAME"))== "Remote", "Remote")
  .when(trim(col("REMOTE_TYPE_NAME"))== "Hybrid Remote", "Hybrid")
  .when(trim(col("REMOTE_TYPE_NAME"))== "Not Remote", "Onsite")
  .when(col("REMOTE_TYPE_NAME").isNull(), "Onsite")
  .otherwise("Onsite")
)

df = df.filter(
  col("MAX_YEARS_EXPERIENCE").isNotNull() &
  col("AVG_SALARY").isNotNull() &
  (col("MAX_YEARS_EXPERIENCE") > 0) &
  (col("AVG_SALARY") > 0 )
)

df_pd = df.select("MAX_YEARS_EXPERIENCE", "AVG_SALARY", "LOT_V6_SPECIALIZED_OCCUPATION_NAME", "REMOTE_GROUP").toPandas()
[Stage 13:>                                                         (0 + 1) / 1]                                                                                

6.2 Scatter Plot

df_pd["MAX_EXPERIENCE_JITTER"] = df_pd["MAX_YEARS_EXPERIENCE"] + np.random.uniform(-0.3, 0.3, size = len(df_pd))
df_pd["AVG_SALARY_JITTER"] = df_pd["AVG_SALARY"] + np.random.uniform(-1500, 1500, size = len(df_pd))

fig = px.scatter(
  df_pd,
  width = 2000,
  height = 800,
  x = "MAX_EXPERIENCE_JITTER",
  y = "AVG_SALARY_JITTER",
  color = "REMOTE_GROUP",
  hover_data=["LOT_V6_SPECIALIZED_OCCUPATION_NAME"],
  title = "Experience vs Salary by Remote Work Type",
  opacity = 0.7,
  color_discrete_sequence=["palevioletred", "mediumseagreen","lightblue"]
)

fig.update_traces(marker=dict(size=7, line=dict(width=1, color="black")))

fig.update_layout(
  plot_bgcolor = "papayawhip",
  font=dict(family = "Montserrat", size = 14),
  title_font = dict(size = 22, weight = "bold"),
  xaxis_title = "Years of Experience",
  yaxis_title = "Average Salary (USD)",
  legend_title = "Remote Work Type",
  hoverlabel = dict(bgcolor = "white", font_size = 13, font_family = "Montserrat"),
  margin=dict(t=70, l=60, r=60),
  xaxis=dict(
    gridcolor="peru",
    tickmode = 'linear',
    dtick=1
  ),
  yaxis=dict(gridcolor = "peru")
)

fig.show()

This graph shows us that there isn’t a significant difference in average salary across the three different remote work types but those outliers with the highest paying job postings are almost entirely onsite.

6.3 Histograms

6.3.1 Onsite

fig = px.histogram(
    df_pd[df_pd["REMOTE_GROUP"] == "Onsite"],
    x="AVG_SALARY",
    nbins=20,
    title=f"Salary Distribution - Onsite",
    color_discrete_sequence=["palevioletred"]
)

fig.update_layout(
    plot_bgcolor="papayawhip",
    xaxis_title="Average Salary (USD)",
    yaxis_title="Count of Job Postings",
    font=dict(family="Montserrat", size=14)
)
fig.show()

This histogram demonstrates that onsite job postings have an average salary of just over $100K with a few outliers ranging from roughly $300K - $800K.

6.3.2 Remote

fig = px.histogram(
    df_pd[df_pd["REMOTE_GROUP"] == "Remote"],
    x="AVG_SALARY",
    nbins=20,
    title=f"Salary Distribution - Remote",
    color_discrete_sequence=["mediumseagreen"]
)

fig.update_layout(
    plot_bgcolor="papayawhip",
    xaxis_title="Average Salary (USD)",
    yaxis_title="Count of Job Postings",
    font=dict(family="Montserrat", size=14)
)
fig.show()

This histogram paints a similar picture as the onsite histogram with the average salary being just over $100K. The cap for remote positions however seems to be around $300K.

6.3.3 Hybrid

fig = px.histogram(
    df_pd[df_pd["REMOTE_GROUP"] == "Hybrid"],
    x="AVG_SALARY",
    nbins=20,
    title=f"Salary Distribution - Hybrid",
    color_discrete_sequence=["lightblue"]
)

fig.update_layout(
    plot_bgcolor="papayawhip",
    xaxis_title="Average Salary (USD)",
    yaxis_title="Count of Job Postings",
    font=dict(family="Montserrat", size=14)
)
fig.show()

The Hybrid histogram looks very similar to the remote histogram with less lucrative outliers but there is a large difference in the drastic decline in number of job postings after ~$125K. It appears that after that point (or so), the number of job postings dwindles rapidly. It’s also worth noting that there are significantly less overall job postings for hybrid postions than fully remote or fully onsite.